Home > Archives  > Abstract

Speaker Identity Using the Convolution Neural Network based Deep Learning Model

Author : Shamala Palaniappan
Abstract
The speaker recognition approach can be categorized into speaker identification and speaker verification. In the definition of domain use, these two subfields have varied slightly. If we have a voice input, the goal of speaker verification is to authenticate by determining a response from a question: "is the voice the voice of somebody?" I'm going to attempt to discover a response for speaker identification: "The voice is whose voice?"Verification may be considered to be a special case of open-set identification. A deep learning model using a convolution neural network (CNN) to identify speakers is proposed in this work. The input of the voice to the method is not constrained by the speaker's words. That means it's more difficult than a text-dependent system in a text-independent form. By the technique, the speaker's voice transforms every 2 seconds into a spectrogram picture and input from scratch into the generated CNN model training. The proposed CNN-based method is compared to the classic Mel-frequency cepstral coefficients (MFCCs) classified by support vector machine (SVM) based featured extraction method. Where MFCC is the most common extracted audio and speech signal function technique up to date. Our suggested technique for using the spectrogram image as an input is also likened to a situation when using the CNN model for the image of the raw signal wave. Experiments are performed on the speech of five speakers speaking in the Thai language from which YouTube voices are extracted. It reveals that the proposed CNN based method trains are the best compared to the other two methods on spectrogram voice image. The average test results set by the proposed method are 95.83%. For the MFCC method is 91.26% and for the CNN model trained on the raw signal, the wave image is only 49.77%. When only brief speech utterance is used as an input, the suggested technique is very effective.
Keywords : Convolution neural network (CNN), deep learning, speaker recognition, text-independent, Mel-frequency cepstral coefficients (MFCCs), support vector machine (SVM), speaker identification.
Volume 3 | Issue 3
DOI :